Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove usage of cluster level setting for circuit breaker #2567

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

jmazanec15
Copy link
Member

Description

Simplification of circuit breaker management. Before, we were periodically updating the circuit breaker cluster setting to have a global circuit breaker. However, this is inconvenient and not actually useful.

This change removes that logic and only trips based on node level circuit breaker. For knn stats, in order to fetch the cluster level circuit breaker, a transport call is made to check all of the nodes. This isnt super efficient but its made for stats calls so its not on the critical path.

Overall, this greatly simplfiies the code. However, it does mean that the cluster setting "knn.circuit_breaker.triggered" will no longer have an effect. But, I think for 3.0 this is okay. For users interested in this info, they can get it from the stats API.

Still fixing up tests pending, but I wanted to get opinions early on the change before investing that time to fix.

Related Issues

Resolves #1607

Check List

  • Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@jmazanec15 jmazanec15 added Refactoring Improve the design, structure, and implementation while preserving its functionality v3.0.0 labels Feb 27, 2025
Simplification of circuit breaker management. Before, we were
periodically updating the circuit breaker cluster setting to have a
global circuit breaker. However, this is inconvenient and not actually
useful.

This change removes that logic and only trips based on node level
circuit breaker. For knn stats, in order to fetch the cluster level
circuit breaker, a transport call is made to check all of the nodes.
This isnt super efficient but its made for stats calls so its not on the
critical path.

Signed-off-by: John Mazanec <[email protected]>
}
}
};
this.threadPool.scheduleWithFixedDelay(runnable, TimeValue.timeValueSeconds(CB_TIME_INTERVAL), ThreadPool.Names.GENERIC);
threadPool.scheduleWithFixedDelay(runnable, TimeValue.timeValueSeconds(CB_TIME_INTERVAL), ThreadPool.Names.GENERIC);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go even a step further an move this to the native cache manager to simplfiy things more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change - I think its fairly straightforward and completely removes the potential circular dependency.

@Override
public Boolean get() {
try {
KNNCircuitBreakerTrippedResponse knnCircuitBreakerTrippedResponse = client.execute(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: This is going to have to somehow be made to be async, which will complicate things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change. Its a bit messy - will clean up a bit.

import java.util.Map;
import java.util.function.Function;

public class CircuitBreakerStat extends KNNStat<Boolean> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: In mixed cluster state, this should fall back to reading from the (deprecated) index setting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this in latest revision

Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
@jmazanec15 jmazanec15 force-pushed the issue-1607 branch 2 times, most recently from b910f38 to df687f7 Compare February 27, 2025 21:05
Signed-off-by: John Mazanec <[email protected]>
@kotwanikunal
Copy link
Member

@jmazanec15 - I like the overall idea of getting rid of the cluster level breaker and relying on the node level breakers.
The concern I have here is - we still trip up the entire cluster instead of just the node. What do you think of making it act up per node?
It heads in the direction of rejecting per node requests based on the current node load, and we can make the coordinator rely on retrying on another node instead.

It will get us closer to the backpressure constructs: https://opensearch.org/blog/shard-indexing-backpressure-in-opensearch/

@@ -500,4 +486,21 @@ private void startMaintenance(Cache<String, NativeMemoryAllocation> cacheInstanc

maintenanceTask = threadPool.scheduleWithFixedDelay(cleanUp, interval, ThreadPool.Names.MANAGEMENT);
}

private void circuitBreakerUpdater() {
Runnable runnable = () -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous approach was directly in the codepath. Are we moving to runnable tasks because we want a mechanism to unset the CB?

I would rather have it as a side effect than a constant monitor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was actually in the KNNCircuitBreaker class: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/KNNCircuitBreaker.java#L60-L109. I moved it to the cache manager because it removes the circular dependency between the 2. However, the logic remains consistent

@Override
public ActionListener<Void> setupContext(KNNStatFetchContext knnStatFetchContext, ActionListener<Void> actionListener) {
// If there are any nodes in the cluster before 3.0, then we need to fall back to checking the CB
if (minVersionSupplier.get().compareTo(Version.V_3_0_0) < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we use the compare from Version class directly? minVersionSupplier.get().orOrBefore(Version.V_3_0_0)

@jmazanec15
Copy link
Member Author

The concern I have here is - we still trip up the entire cluster instead of just the node. What do you think of making it act up per node?

@kotwanikunal Why do you say that? We only trip up per node.

Signed-off-by: John Mazanec <[email protected]>
@kotwanikunal
Copy link
Member

The concern I have here is - we still trip up the entire cluster instead of just the node. What do you think of making it act up per node?

@kotwanikunal Why do you say that? We only trip up per node.

The concern I have here is - we still trip up the entire cluster instead of just the node. What do you think of making it act up per node?

@kotwanikunal Why do you say that? We only trip up per node.

CircuitBreakerStat is still marked as a cluster level stat. I was wondering if we could get away from maintaining it at the cluster level all together.

@jmazanec15
Copy link
Member Author

jmazanec15 commented Feb 28, 2025

@kotwanikunal Oh I see. Yeah I figured thatd be a breaking change and so couldnt remove it without deprecation. Hence, the stats update. I might be mistaken

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Refactoring Improve the design, structure, and implementation while preserving its functionality v3.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove cluster setting for native memory circuit breaker
2 participants